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Objective Many applications such as biomedical signals require selecting a subset of the input features in order to represent the 
whole set of features. A feature selection algorithm has recently been proposed as a new approach for feature subset selection. 

Methods Feature selection process using ant colony optimization (ACO) for 6 channel pre-treatment electroencephalogram (EEG) 
data from theta and delta frequency bands is combined with back propagation neural network (BPNN) classification method for 147 
major depressive disorder (MDD) subjects. 

Results BPNN classified R subjects with 91.83% overall accuracy and 95.55% subjects detection sensitivity. Area under ROC curve 
(AUC) value after feature selection increased from 0.8531 to 0.911. The features selected by the optimization algorithm were Fpl, 
Fp2, F7, F8, F3 for theta frequency band and eliminated 7 features from 12 to 5 feature subset. 

Conclusion ACO feature selection algorithm improves the classification accuracy of BPNN. Using other feature selection algo- 
rithms or classifiers to compare the performance for each approach is important to underline the validity and versatility of the de- 
signed combination. Psychiatry Investig 2014;11(3):243-250 
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INTRODUCTION 

Reduction of pattern dimensionality using feature extraction 
is one of the most important steps for classification process. 
Feature selection has also considerable importance in areas such 
as bioinformatics, 1 3 signal processing, 4 8 image processing, 9 " 11 
text categorization, 12 data mining, 13 pattern recognition 14 " 18 and 
medical diagnosis. 19,20 The aim of feature selection is to choose 
a subset of available features by eliminating less important or 
unnecessary features. To extract as much information as possi- 
ble from a given set while using a smaller number of features, 
the features with little or no predictive information is to be 
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eliminated, and strongly correlated redundant features are to be 
ignored. 21 Thus, a large amount of computation time can be 
saved with a valuable subset. The selected subset of features 
used to represent such classification function influences sev- 
eral aspects of classification, including the time required to 
learn a classification function, the accuracy of the learned 
classification algorithm, the time-space cost associated with 
the features, and the number of samples required for train- 
ing. MDD is considered to be a chronic, relapsing and remit- 
ting illness and early medical diagnosis is important for the 
consequent treatment process. Many of the patients (30-50%) 
fail to respond to initial antidepressant treatment process. 22 
So there is a clear need for methods that select the right treat- 
ment for the right patient. 23 Repetitive transcranial magnetic 
stimulation (rTMS) has been proposed as an alternative 24 
with its less invasive and less painful treatment process com- 
pared to electrical brain stimulation application. 25 In the light 
of "Personalized Medicine" perspective to depression, re- 
cently both ACO and neuroimaging biomarkers have been 
studied and point promising results in aiding treatment pre- 
diction using pre-treatment measures. 26 29 
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Studies have been conducted mainly with neurophysiolog- 
ical EEG biomarkers 30,31 and functional neuroimaging bio- 
markers 32,33 which focused on the predictive effect of change 
of frontal quantitative EEG (QEEG) cordance in theta and 
delta frequency bands. In, 34 EEG data is analyzed to compare 
normal subjects versus subjects suffering from various men- 
tal disorders. It was found that a change in delta or theta band 
EEG power can be evaluated as a specific marker of brain 
dysfunction. 35 Considerable number of applications under- 
line that the AD medication effects are physiologically de- 
tectable in the EEG and QEEG cordance is one of the auspi- 
cious biomarkers used to predict the treatment response which 
has generated research interest. In addition to its valuable 
contribution as biomarker, EEG patterns with optimized sub- 
set using ACO to minimize the number of features while max- 
imizing classification performance. 

In a study, ACO was compared with other well-known 
feature selection and projection techniques using two differ- 
ent biosignal-driven applications, 36 ACO was used as a fea- 
ture selection method to classify hand motion surface elec- 
tromyography signals in another study. 27 Another feature 
selection application using ACO was used for images from 
the mammography image analysis society database. 28 ACO 
method was also tested on one of the most important biosig- 
nal driven applications, which is the Brain Computer Inter- 
face (BCI) problem with 56 EEG channels. 29 In a pilot study 
the algorithm was introduced to select genes relevant to can- 
cers first, then the multilayer perceptron neural network and 
support vector machine classifiers were used for cancer clas- 
sification. 37 The main goal for the clinical research in the 
MDD is predicting the response of MDD subjects to rTMS 
therapy using their pre-treatment QEEG cordance and en- 
hancement of the diagnostic accuracy. These are crucially im- 
portant for the proper medical treatment and slowing down 
of the progress of the illness. An ANN based model combined 
with an optimization algorithm was designed as a tool in order 
to reduce the number of features while increasing the predic- 
tion accuracy. 

In this paper, an artificial intelligence approach combining 
ACO and BPNN is proposed to classify MDD subjects as re- 
sponder (R) or non-responder (NR) to rTMS treatment using 
most relevant features. 

METHODS 
Participants 

The research was conducted in Neuropsychiatry Istanbul 
Hospital to understand from the value of QEEG to classify the 
MDD subjects before rTMS treatment as R or NR. The study 
has been formally approved by the local Medical Research 



Ethics Committee. This study was based on an open-label de- 
sign. Patients who were willing to participate first visited a psy- 
chiatrist in order to assess if they met the inclusion criteria. All 
subjects were free of psychotropic medication for at least two 
weeks prior to enrollment. Subjects with nonpsychotic depres- 
sive disorder as defined by International Statistical Classifica- 
tion of Diseases and Related Health Problems (ICD-10) crite- 
ria and determined by 17-item Hamilton Depression Rating 
Scale (HAM-D) score higher than 14 were eligible. 

A total of 147 major depressive disorder subjects, resistant 
to medication treatment, completed the protocols and were 
examined for the study. Responder and non-responder groups 
did not differ with respect to the psychopharmacological 
treatment process. In order to minimize potential confusing 
outcomes of pharmacological withdrawal symptoms, all sub- 
jects were on a monotherapy regimen and received concur- 
rent selective serotonin reuptake inhibitor (SSRI) antidepres- 
sant medication during their 3 weeks, 20 sessions of rTMS 
therapy. No patients were receiving lithium or mood stabiliz- 
er or benzodiazepines. A baseline clinical assessment was con- 
ducted in the day prior to rTMS treatment by a psychiatrist 
using the 17 item Hamilton depression scale. Patients were 
assessed twice during the study using clinical, neuropsycho- 
logical and QEEG assessments. Routine laboratory studies 
(complete blood count, chemistry, thyroid stimulating hor- 
mone), urine toxicology screen, and electrocardiogram were 
performed at study screening, and subjects were required to 
be medically stable before entry. Patients with organic brain 
disorders, with pacemakers, psychotic symptoms, dementia, 
delirium, substance-related disorders, cluster A or B axis II 
disorders, patients treated with electroconvulsive therapy 
(ECT) in the prior six months, patients having any past his- 
tory of craniotomy, skull fracture, seizures, or significant neu- 
rological illness and the ones who had past history of suicidal 
intent, plan, or attempt were ineligible (exclusion criteria). 

EEG recordings 

During pre-treatment QEEG, subjects were instructed to 
rest in the eyes-closed, maximally alert state, in a quiet room 
with subdued lighting. The technicians monitored the QEEG 
data during the recording and re-alerted the subjects every 
minute as needed to avoid drowsiness. Electrodes were placed 
with an electrode using 19 recording electrodes distributed 
across the head according to the international 10-20 system 
arrangement. Three minutes of eye-closed EEG at rest were 
acquired using Scan LT EEG amplifier and electrode cap (Com- 
pumedics/Neuroscan, USA) with the sampling rate of 250 Hz. 
19 sintered Ag/AgCl electrodes positioned according to the 
10/20 International System with binaural reference. EEG sig- 
nals received from 6 electrodes (Fpl, Fp2, F3, F4, F7, and F8) 
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in slow bands (delta and theta). Raw EEG signal was filtered 
through a band-pass filter (0.15-30 Hz) before artifact elimi- 
nation. Manually selected (minimum 2 minutes) artifact-free 
EEG data which has minimum split-half reliability ratio of 
0.95 and test-retest reliability ratio of 0.90 were used for cor- 
dance calculations. Fast Fourier transform was used to calcu- 
late absolute and relative power in each of two non-overlap- 
ping frequency bands: delta (1-4 Hz) and theta (4-8 Hz). 
Leuchter and colleagues stated the EEG cordance method 
first to provide a measure, which had face validity for the de- 
tection of cortical elimination or interruption of sensory nerve 
fibers. It was noticed that the EEG over a white-matter lesion 
caused absolute theta power decrease but relative theta pow- 
er increase which is called as "discordant". So, the EEG cor- 
dance calculation process deals with both absolute and rela- 
tive EEG power. Negative values of cordance (discordance) 
are used to underline low perfusion or metabolism, while 
positive values (concordance) are evaluated as high perfu- 
sion or metabolism indication. 38 A subsequent study corrob- 
orated the method comparing EEG cordance with simulta- 
neously recorded PET scans reflecting perfusion. 39 

rTMS session procedures and ratings 

rTMS was applied using the Magstim Super Rapid2 stimu- 
lator (Magstim Company, Whitland, UK) with figure-of- 
eight shaped Air Film Coil in all patients in an open-label 
manner. The rTMS intensity was set at 100% of the motor 
threshold which was determined by visual inspection. Stimu- 
lations were given to the left prefrontal cortex, deemed to be 
located anterior to the cortical motor area of the abductor 
pollicis brevis of which the motor threshold was determined. 
The treatment schedule was six days in a week, from Monday 
to Saturday for three weeks. 25Hz stimulation with the dura- 
tion of 2 seconds was delivered 20 times with 30-second in- 
tervals. A full course comprised 1000 magnetic pulses. 

Subjects were classified as "responders" if the HAM-D 
score at three weeks showed at least a 50% improvement over 
the pre-treatment HamD score. The HamD is a well-accept- 
ed means of quantifying the severity of depression. For our 
purposes, the HamD percentage change value is discretized 
into two values (or classes), corresponding to responder (R) 
when it is larger than or equal to 50%, and non-responder 
(NR) otherwise. 40 Table 1 gives the HAMD scores for each 
group before and after rTMS treatment. 

BP neural network 

Artificial neural networks are widely used solving prob- 
lems in analyzing biomedical signals, because of their variety 
of applicability and their ability to learn complex and nonlin- 
ear relations. ANNs are trained by example data set instead 



Table 1. Hamilton Depression Rating Scale (HAMD) Scores of R 
and NR Subjects before and after repetitive transcranial magnetic 
stimulation treatment 



Number of subjects 


Responder 


Non-responder 


90 


57 


Pre-treatment HAMD (±SD) 
Post-treatment HAMD (±SD) 


23.84±3.57 
9.13±1.93 


22.93±3.66 
14.87±3.11 



of rules. When used in diagnosis of neuromuscular disorders, 
ANNs are not affected by factors such as human fatigue, emo- 
tional states, and habituation. They are also capable of rapid 
identification, analysis of conditions, and diagnosis in real 
time. 41 There are various types and architectures of neural 
networks varying fundamentally in the way they learn; the 
details of which are well documented in literature. 42,43 

BP neural network is a typical multilayer feed forward net- 
work trained according to back propagation algorithm. BP 
neural network uses parallel distributed processing approach 
to handle both qualitative and quantitative knowledge. It has 
strong robustness, fault tolerance and adaptability and can 
fully approximate any complex nonlinear relationship. 44 Be- 
cause of these advantages, BP neural network is more appro- 
priate for processing EEG data which is possible noisy, un- 
stable and nonlinear. In this study, for modeling process, feed- 
forward neural network trained by a backpropagation algorithm 
is used. The network is based on the supervised procedure, 
i.e. the network constructs a model based on examples of 
data with known outputs. The architecture of the network is 
a layered feed-forward neural network, in which the non-lin- 
ear elements (neurons) are arranged in successive layers, and 
the information flows from input layer to output layer, through 
the hidden layer(s). 45 Input data is received from 6 electrodes 
as QEEG cordance, 10 neurons were used in hidden layer 
and sigmoid transfer function used in each neuron because 
of its nonlinear behavior. In order to minimize the error be- 
tween the model output and a reference value MSE (mean 
square error) is used as the cost function, given in equation 1. 
The cost function is minimized by ACO. 

1 N 

K w )=l^I(y k -z k ) 2 (D 

k=l 

Where y k is the output of the model and z k is the reference 
output. 

Feature selection with ACO algorithm 

Feature selection and dimension reduction are important 
steps in a pattern recognition tasks. In this study, although the 
feature set was not excessive and giving satisfactory outcomes, 
using the most informative features increased the classifica- 
tion rate. Reducing the number of features also enabled the 
classifier to learn a more robust solution and achieved a bet- 
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ter generalization performance. In order to get an optimal sub- 
set of features ACO algorithm was employed and the flow 
chart of optimization process is given in Figure 1. 

The feature selection optimization process steps are per- 
formed combining the optimization algorithm and BPNN 
classifier. Selected features by ACO are transferred to the clas- 
sifier and generated model is tested using test set with the as- 
signed features. The performance of each ant is then evaluat- 
ed with MSE error to update the pheromone table finally. The 
process continues to satisfy the stopping ciriteria which is de- 
fined as an optimal error value. 

ACO is are an iterative, probabilistic meta-heuristic for 
finding solutions to combinatorial optimization problems. It 
is based on the foraging mechanism employed by real bio- 
logical ants attempting to find a short path from their nest to 
a food source. While foraging, the ants communicate indi- 
rectly using their pheromone, which they use to mark their 
respective paths and which attract other ants. In the ACO al- 
gorithm, artificial ants (agents) generate virtual pheromone 
to update their path through the decision graph, i.e. the path 
that reflects which alternative node an agent will chose. The 
amount and density of pheromone an agent uses to update 
its path depends on how good the solution is copared to 
those found by competing former agents of the same itera- 
tion. Following agents use the pheromone markings of previ- 
ous good agents as a means of orientation when making 
their own selections to find the shortest path of all possible 
alternatives. 46 

Since this problem closely resembles finding the shortest 



path to a food source, the ACO was first applied to the opti- 
mization of traveling salesman problem (TSP). 47 In such a 
problem, a set of cities (nodes) is given and the distance be- 
tween each is known. The aim is to find the shortest path that 
allows each city to be visited just once. Alternative paths are 
generated on the basis of a probabilistic model and in the 
ACO metaphor, these paths are said to be constructed by ar- 
tificial ants walking on the graph that encodes the problem 
in which each vertex represents a city and each edge represents 
a connection between two cities. Initial attempts for building an 
ACO algorithm were not satisfying until the algorithm was 
coupled with a local optimizer. 48 One problem is early con- 
vergence to a less than optimal solution because too much 
virtual pheromone was laid quickly. To avoid this problem, 
pheromone evaporation is implemented. In other words, the 
pheromone associated with a solution disappears after a pe- 
riod of time. In the construction of a solution, ants select the 
following city to be visited through a stochastic mechanism. 
When ant k is in city i and has so far constructed the partial 
solution s p , the probability of going to city j is given as: 

^•n§ 

(2) 



lEc^N^-riS 
0 



ifc lj eN(s p ) 
otherwise 



Where N(s p ) represent the set of feasible nodes, a and v are 
constants to control the relative importance of the pheromone 
versus the heuristic information, rjy which is given as: 



1 



(3) 



Where dy is the distance between city i and city j. 



Ant Colony Optimization 



Initialize pheromone table 



Construct a solution 
for each ant 



Selected feature 
subset 



Update the best ant table 




Classifier: back propagation neural network 
Training set Test set 



Selected features 



Create model with 
training set 



Test set with 
selected features 



Calculate the classification 
accuracy for the test set 



Select best ant's features 



Y 

End 



Figure 1. Design of proposed ant colony 
optimization based feature selection of 
parameters. 
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During each of the iterations the pheromone values are 
updated by all the mants that have built solutions in the itera- 
tion itself. The pheromone Ty, associated with the edge join- 
ing cities i and j is updated as follows: 



T,<-(l-p).T,+ lSdAlj 



(4) 



Where p is the evaporation rate, m is the number of ants, 
and At| is the quantity of pheromone laid on edge (/, j) by 
ant k, 

where: 



Ai*=i Ik if ant use d edge (i, j) in its tour, 
*■ 0 otherwise 



(5) 



Where 5 is a constant and L k is the length of the tour con- 
structed by ant k. 29 

In this study, the value for each feature is represented by a 
node and the vectors between nodes can be considered as 
the paths between nodes. Before an ant starts from a ran- 
domly selected path, a hundred possible values were pro- 
posed for each node to enable variety. Each one of the con- 
nections between nodes from the 1st to the 12th node via 
various paths is a solution created by the visit of an ant to be 
evaluated. So a cost value is assigned to the travelling ant re- 
garding to the path performance. Therefore better ants will 
track the trajectory closer and will generate lower costs en- 
suring higher pheromone density on the path. Same loop is 
repeated by other ants and the feature optimization process is 
not terminated unless the desired fitness function is met. The 
process is started by a simple individual ant from the ran- 
domly selected path, and the optimal path is found by colony 
evolution. Travelling ants also deposit pheromones to the paths 
they passed to enable others follow their trail. The phero- 
mones are updated and evaporated regularly to let others 
find alternative paths and solutions. Local pheromone up- 
dates and global pheromone updates were used to update 
pheromone table after each iteration and tour respectively. 
After an ant completes iteration, it updates local pheromone 
table as given in equation 6; 



T(n)« = -r(n-l)« + 



(0.01x9) 
J 



(6) 



where T(k)ij is the pheromone value between nest (i) and (j) 
at the nth iteration, 

9 is the general pheromone updating coefficient, 

/ is the cost function for the tour travelled by the ant. 

After each tour, a hundred iterations are completed and 
global pheromone update process starts. Pheromones of the 
paths belonging to the best and worst of the tour are updated 
as given in the following equations 7 and 8 respectively: 



T(n)^' = t(n)f + T t3 - 

' Jbest 

T(n)™° rs, = T(n)™ rst -2j^ 

'worst 



(7) 
(8) 



where t(n)ij est and T(n)™ rst are the pheromones of the paths 
followed by the ant in the tour with the lowest (J bes t) and 
highest cost value (J worst ) in one iteration respectively. The 
pheromone evaporation, given in equation 8, decrease the 
pheromone density of the visited paths to let the ants visit 
low density paths assuring diversity. 



T(n) ii= T(n)|+[T(nftT(n)n 



where X is the evaporation constant. 



(9) 



RESULTS 

In this study, an up-to-date swarm intelligence method, 
ACO, was employed as feature selection algorithm for 12 in- 
puts and then BP neural network was used to classify 147 
subjects as responder or non-responder. 6-fold CV was per- 
formed to train and test the classifier with stratified sam- 
pling. That hybrid approach, combining BPNN classifier 
with ACO algorithm, was significantly affected by the num- 
ber of the selected features and contributed to the perfor- 
mance of classification. The combination of classification re- 
sults before and after feature selection process are given with 
overall accuracy, sensitivity and area under Receiver Operat- 
ing Characteristic (ROC) curve parameters in table 2 and the 
ROC curve for the compared approaches is plotted in figure 2. 

Throughout the classification process, the intersection 
point of true positive rate (TPR) and false positive rate (FPR) 
at each threshold is plotted to form the ROC curve. Each point 
on the ROC curve represents a sensitivity/(l-specificity) pair 
corresponding to a particular decision threshold. Depending 
on the classification performance, the relative changes of 
TPR and FPR may differentiate causing sharp transitions be- 
tween cut off points in ROC curve. 

After the frequency band and channel selection phase, 
ACO algorithm was used to reduce the feature set by consid- 
ering the classification error as cost function. The contribution 
of feature selection process to the accuracy is quite satisfacto- 



Table 2. Repetitive transcranial magnetic stimulation treatment 
responder results using ant colony optimization (ACO) 



Feature selection 
method 


Number 
of features 


Accuracy 

% 


Sensitivity 


AUC 


None 


12 


80.25 


84.44% 


0.8531 


ACO Algorithm 


5 


91.83 


95.55% 


0.911 



AUC: area under curve 
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Figure 2. Receiver operating characteristic curves for back prop- 
agation neural network and back propagation neural network with 
ant colony optimization (ACO). BPNN: back propagation neural 
network. 

ry. NN classified R subjects with 91.83% overall accuracy, 
percentage of examples been classified correctly, and 95.55% 
subjects detection sensitivity. Area under ROC curve (AUC) 
was also used to evaluate the performance of ACO algorithm. 
AUC value after feature selection increased from 0.8531 to 
0.911. The features selected by the optimization algorithm 
were Fpl, Fp2, F7, F8, F3 for theta frequency band and elimi- 
nated 7 features from 12 to 5 feature subset. 

Since cordance values are correlated with regional cerebral 
blood flow, findings with this measure could be interpreted 
within the same conceptual framework as other functional 
neuroimaging studies demonstrating an abnormal pattern of 
metabolism or perfusion in the prefrontal cortex and the an- 
terior cingulate in depressed patients. Moreover, frontal elec- 
trical activity in theta frequency band has been associated 
with the function of these structures and previous research 
has linked pretreatment theta activity of the anterior cingu- 
late with clinical response. 23,26,49 The results of our study sup- 
port the former clinical researches and focus on the prefron- 
tal region and theta frequency band for MDD patients. 

DISCUSSION 

High dimensionality nature of QEEG data caused by the 
use of high number of electrodes and long periods of task 
time is one of the drawbacks in QEEG study. Evolutionary 
based approaches are alternative methodologies to conven- 
tional dimension reduction methods with the advantage of 
not requiring the entire recording sessions for operation. 

ACO is an evolutionary method that achieves perfor- 
mance through evaluation of several generations of possible 
solutions. Optimizing the feature set enables the classifier to 



work with a reduced sized data set. Treatment response pre- 
diction is crucially important for proper clinical treatment 
and medical research and various classification methods are 
proposed in literature. 

This paper utilizes and combines a machine learning tech- 
nique and a meta-heuristic approach to classify subjects using 
pre-rTMS treatment data. The ant colony optimization algo- 
rithm was first introduced to select features relevant to MDD 
subjects before classification then the ANN classifier was used 
for classification. Experimental results show that selecting fea- 
tures by using ACO can improve the accuracy of the classifier. 

Similar studies for performance evaluation using the com- 
bination of modeling approaches and feature selection tech- 
niques generated quite satisfactory outcomes to underline the 
validity and reliability of the method used in this study. Fea- 
ture selection of EEG signals for schizophrenic patients, 50 al- 
zheimer patients, 51 depression patients 52 and patients suffering 
from epilepsy 53 were also studied and contributed to the com- 
bination of optimization algorithms and neural networks to 
increase the classification performance. 

Using ACO as a feature selection method, various studies 
was also used biomedical data. 54,55 The machine learning par- 
adigm has been applied in a study using ANN fed with EEG 
data to differentiate three classes of subjects: those with schizo- 
phrenia, those with depression, and healthy subjects. 56 Com- 
bining various biomarkers, statistical methods were also used 
to predict 23,40,57 treatment results. In order to increase the pre- 
diction performance, various feature selection methods were 
proposed for multi-channel EEG data. 58,59 Some other studies 
underlined the performance of ACO as feature selection 
method comparing to principal component analysis, genetic 
algorithm, random tree generation and differential evolution 
methods. 27,28,36,54,60 

Although the proposed ACO feature selection algorithm 
improves the classification accuracy of ANN, it still needs 
further investigation on other type of classifiers. Using other 
feature selection algorithms or classifiers to compare the per- 
formance for each approach is important to underline the 
validity and versatility of the designed combination. The re- 
sults show that the approach is suitable for biological data 
classification and promising which is thus highly applicable 
to clinical studies requiring diagnostic results. 
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